Python - All a Scientist Needs

نویسنده

  • Julius B. Lucks
چکیده

Any cutting-edge scientific research project requires a myriad of computational tools for data generation, management, analysis and visualization. Python is a flexible and extensible scientific programming platform that offered the perfect solution in our recent comparative genomics investigation [1]. In this paper, we discuss the challenges of this project, and how the combined power of Biopython [2], Matplotlib [3] and SWIG [4] were utilized for the required computational tasks. We finish by discussing how python goes beyond being a convenient programming language, and promotes good scientific practice by enabling clean code, integration with professional programming techniques such as unit testing, and strong data provenance. 1 The Scientists Dilemma A typical scientific research project requires a variety of computational tasks to be performed. At the very heart of every investigation is the generation of data to test hypotheses. An experimental physicist builds instruments to collect light scattering data; a crystallographer collects X-ray diffraction data; a biologist collects fluorescence intensity data for reporter genes, or DNA sequence data for these genes; and a computational researcher writes programs to generate simulation data. All of these scientists use computer programs to control instruments or perform simulations to collect and manage data in an electronic format. Once data is collected, the next task is to analyze it in the context of hypothesis-driven models that help them understand the phenomenon they are studying. In the case of light, or X-ray scattering data, there is a well-proven physical theory that is used to process the data and calculate the observed structure function of the material being studied [5]. This structure function is then compared to predictions made by the hypotheses begin tested. In the case of biological reporter gene data, light intensity is matched up with phenotypic traits or DNA sequences, and statistically analyzed for trends that might explain the observed patterns. As these examples illustrate, across science, the original raw data of each investigation is extensively processed by computational programs in an effort to understand the underlying phenomena. Visualization tools to create a variety of scientific plots are often a preferred tool for both troubleshooting ongoing experiments, and creating publication-quality scientific plots and charts. These plots and charts are often the final product of a scientific investigation in the form of data-rich graphics that demonstrate the truth of a hypothesis compared to its alternatives [6]. Unfortunately, all too often scientists resort to a grab-bag of tools to perform these varied computational tasks. For physicists and theoretical chemists, it is common to use C or FORTRAN to generate simulation data, and C code is used to control experimental apparatus; for biologists, perl is the language of choice to manipulate DNA sequence data [7]. Data analysis is performed in separate, external software packages such as Matlab or Mathematica for equation solving [8, 9], or Stata, SPSS or R for statistical calculations [10, 11, 12]. Furthermore, separate data visualization packages can be used, making the scientific programming toolset extremely varied. Such a mixed bag of tools is an inadequate solution for a variety of reasons. From a computational perspective, most of these tools cannot be pipelined easily which necessitates many manual steps or excessive glue code that most scientists are not trained to write. Far more important than just an inconvenience 1 ar X iv :0 80 3. 18 38 v1 [ qbi o. Q M ] 1 2 M ar 2 00 8 associated with gluing these tools together is the extreme burden placed on the scientist in terms of data management. In complicated systems, there are often a plethora of different data files in several different formats residing at many different locations. Most tools do not produce adequate metadata for these files, and scientists typically fall back on cryptic file naming schemes to indicate what type of data the files contain and how it was generated. Such complications can easily lead to mistakes. This in turn provides poor at best data provenance when it is in fact of utmost importance in scientific studies where data integrity is the foundation of every conclusion reached and every fact established. Furthermore, when data files are manually moved around from tool to tool, it is not clear if an error is due to program error, or human error in using the wrong file. Analyses can only be repeated by following work flows that have to be manually recorded in a paper or electronic lab notebook. This practice makes steps easily forgotten, and hard to pass on to future generations of scientists, or current peers trying to reproduce scientific results. The Python programming language and associated community tools [13] can help scientists overcome some of these problems by providing a general scientific programming platform that allows scientists to generate, analyze, visualize and manage their data within the same computational framework. Python can be used to generate simulation data, or control instrumentation to capture data. Data analysis can be accomplished in the same way, and there are graphics libraries that can produce scientific charts and graphs. Furthermore python code can be used to glue all of these python solutions together so that visualization code resides alongside the code that generates the data it is applied to. This allows streamlined generation of data and its analysis, which makes data management feasible. Most importantly, such a uniform tool set allows the scientist to record the steps used in data work flows to be written down in python code itself, allowing automatic provenance tracking. In this paper, we outline a recent comparative genomics case study where python and associated community libraries were used as a complete scientific programming platform. We introduce several specific python libraries and tools, and how they were used to facilitate input of standardized biological data, create scientific plots, and provide solutions to speed bottle-necks in the code. Throughout, we provide detailed tutorial-style examples of how these tools were used, and point to resources for further reading on these topics. We conclude with ideas about how python promotes good scientific programing practices, and tips for scientists interested in learning more about python. 2 A Comparative Genomics Case Study Recently we performed a comparative genomics study of the genomic DNA sequences of the 74 sequenced bacteriophages that infect E. coli, P. aeruginosa, or L. lactis [1]. Bacteriophages are viruses that infect bacteria. The DNA sequences of these bacteriophages contain important clues as to how the relationship with their host has shaped their evolution. Each virus that we examined has a DNA genome that is a long strand of four nucleotides called Adenine (A), Threonine (T), Cytosine (C), and Guanine (G). The specific sequences of A’s, T’s, C’s and G’s encode for proteins that the virus uses to take over the host bacteria and create more copies of itself. Each protein is encoded in a specific region of the genomic DNA called a gene. Proteins are made up of linear strings of 20 amino acids. There are 4 bases encoding for 20 amino acids, and the translation table that governs the encoding, called the genetic code, is comprised of 3 base triplets called codons. Each codon encodes a specific amino acid. Since there are 64 possible codons, and only 20 amino acids, there is a large degeneracy in the genetic code. For more information on the genetic code, and the biological process of converting DNA sequences into proteins, see [14]. Because of this degeneracy, each protein can be ‘spelled’ as a sequence of codons in many possible ways. The particular sequence of codons used to spell a given protein in a gene is called the gene’s ‘codon usage’. As we found in [1], bacteriophages genomes favor certain codon spellings of genes over the other possibilites. The primary question of our investigation was does the observed spellings of the bacteriophage genome shed light onto the relationship between the bacteriophage and its host [1]?

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hardware-accelerated interactive data visualization for neuroscience in Python

Large datasets are becoming more and more common in science, particularly in neuroscience where experimental techniques are rapidly evolving. Obtaining interpretable results from raw data can sometimes be done automatically; however, there are numerous situations where there is a need, at all processing stages, to visualize the data in an interactive way. This enables the scientist to gain intu...

متن کامل

Invitation to Ezhil: A Tamil Programming Language for Early Computer-Science Education

Ezhil is a Tamil programming language with support for imperative programming, with mixed use of Tamil and English identifiers and function-names. Ezhil programing system is targeted toward the K-12 (junior high-school) level Tamil speaking students, as an early introduction to thinking like a computer-scientist. We believe this 'numeracy' knowledge is easily transferred over from a native lang...

متن کامل

F2PY: a tool for connecting Fortran and Python programs

In this paper we tackle the problem of connecting low-level Fortran programs to high-level Python programs. The difficulties of mixed language programming between Fortran and C are resolved in an almost compiler and platform independent way. We provide a polished software tool F2PY that can (semi-)automatically build interfaces between the Python and Fortran languages and hence almost completel...

متن کامل

InRaDoS: An internal radiation dosimetry computer program

Introduction: Internal radiation dosimetry is important from a radiation protection point of view and can help to optimize the radiation dose delivered to the workers, public, and patients. It has a rather simple protocol but needs a large amount of data. Therefore, it is difficult to do on a routine basis. The use of computer programs makes internal radiation dosimetry simpler...

متن کامل

Sequential Homogenization of Reactive Transport in Polydisperse Porous Media

Bio BIO I'm a professional computational scientist. I specialize in simulations of complex multidimensional flows and transport (6 years), simulation of heat transfer (6 years), parallel computations in MPI (6 years), scientific visualization and data processing (6 years), simulations and mesh preparation in OpenFOAM (2 years), development in Wolfram Mathematica (6 years), Python and Fortran 90...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008